Inspiration:
The central question for this project is: What is the physicochemical component that influences wine quality the greatest? Each of the components alter the quality, but there could possibly be a component that stands out and plays a heavy role in determining a wine’s quality. Additionally, we would like to see what a wine’s input component values look like at each quality level and if certain elements of wine have a “relationship” (tendencies to use more or less of one component when a different component is added).
About the data:
Our data is related to the red and white variants of the Portuguese “Vinho Verde” wine data collected by a team of scientists utilizing machine learning in an attempt to predict human wine taste preferences based on the contents of the wine. The data set includes the physicochemical (inputs) and sensory (the output) variables. The input variables are:
fixed acidity - most acids involved with wine or fixed or nonvolatile (do not evaporate readily)volatile acidity - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar tastecitric acid - found in small quantities, citric acid can add ‘freshness’ and flavor to winesresidual sugar - the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweetchlorides - the amount of salt in the winefree sulfur dioxide - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of winetotal sulfur dioxide - amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of winedensity - the density of water is close to that of water depending on the percent alcohol and sugar content;pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wine are between 3-4 on the pH scale;sulphates - a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidantalcohol - the percent alcohol content of the wineThe output variable is:
* quality - output variable (based on sensory data, score between 0 and 10)
Data modifications:
We first created a new data set, mutated_data, that mutated the original data set by changing the quality values to character values. Secondly, we created another data set, wine2, that summarized the original data set with the mean and standard deviation values for each input variable. Lastly, the third data set created, wine3, takes the original data set, groups by the quality value, and again summarizes with the mean and standard deviation of each input component. Wine3 is also used to create smaller data sets that are filtered by quality level for later use.
Data citiation:
Learning, UCI Machine. “Red Wine Quality.” Kaggle, 27 Nov. 2017, www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009.
Mission:
We started by examining the output component, quality, using various methods. Then, each of the input components were examined individually to show various features. A correlation chart was then created to determine if there are any relationships between the components. Lastly, a linear regression is used to determine the variables that influence the quality of wine the most.
First, we take a look at the output variable, quantity
A bar graph is created to see the amount of wine that falls within each quality value.
Next, the data is grouped by quality and the means of the component usages for each quality level are calculated. These means are graphed to display trends that occur with increasing quality
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Next, each of the input variables are examined in detail
Each of the input variables are looked at using their counts at various amounts of usuage, using a box plot that shows their values at the various quality levels, and finally using density distributions that are all isolated by quality values. These give us a good look at the most common values used in general and at each quality level.
Fixed Acidity
Volatile Acidity
Citric Acid
Redisual Sugar
Chlorides
Free Sulfur Dioxide
Total Sulfur Dioxide
Density
pH
Sulphates
Alcohol
Thirdly, a correlation chart is created to unveil relationships between various components
The wine data set is rounded to create a correlation matrix. This is then melted to Var1 (which includes the first set of variables), Var2 (contains the second set of variables, but should be the same because the data set was rounded), and the correlation value. These are then plotted using ggplot.
It can be seen that citric acid and fixed acidity have a relatively strong relationship. In addition, density and fixed acidity and total sulfur dioxide and free sulfur dioxide have relatively strong relationships. Also, pH and fixed acidity, pH and citric acid, and alcohol and density have moderatly strong inverse relationships. It is also worth noting that alcohol has the strongest relationship with quality, which will be examined further next.
We could have determine the most influential components using their relationship to quality in the correlation chart, but to increase our confidence, we want to normalize the variables and use a linear regression model.
Lastly, a linear regression model is used with the normalized variable values to determine the most important components in wine (alter its quality the most when all compenents are added in the same quantity)
The dataset wine is first normalized using this equation
\[ x_{norm} = \left(\frac {x - x_{min}}{x_{max} - x_{min}} \right) \]
After, the normalized values are inputed into the linear regression model and summarized
##
## Call:
## lm(formula = quality ~ ., data = normalized_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7126 0.1509 37.846 < 2e-16 ***
## normal_fixed_acidity 0.2824 0.2932 0.963 0.3357
## normal_volatile_acidity -1.5820 0.1768 -8.948 < 2e-16 ***
## normal_citric_acid -0.1826 0.1472 -1.240 0.2150
## normal_residual_sugar 0.2384 0.2190 1.089 0.2765
## normal_chlorides -1.1227 0.2512 -4.470 8.37e-06 ***
## normal_free_sulfur_dioxide 0.3097 0.1542 2.009 0.0447 *
## normal_total_sulfur_dioxide -0.9239 0.2062 -4.480 8.00e-06 ***
## normal_density -0.2435 0.2946 -0.827 0.4086
## normal_pH -0.5253 0.2433 -2.159 0.0310 *
## normal_sulphates 1.5303 0.1909 8.014 2.13e-15 ***
## normal_alcohol 1.7953 0.1721 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
In conclusion, using the estimate values (which are the slopes), the greater the absolute value, the greater that variable will change the quality of a wine with an increase or decrease in use. The rankings in order are shown with their slope values included:
Ranking of most important physicochemical components:
With these variables ranked, it is clear to see that the alcohol content is crucial to quality, but volatile acidity, sulphates, chlorides, and total sulfur dioxide are aslo major contributors to a wine’s quality. This means that these components must be considered heavily while creating wine in order to achieve a great quality drink.
This raises two last questions: Should the top five physicochemical compenents ranked be the only ones considered due to a sharp dropoff after total sulfur dioxide in terms of thier affect on wine quality? Will increasing or decreasing the amount of an ingredient always alter the quality in the same way?
Based on our data analysis, it can be determined that alcohol is the most critical physicochemical component when it comes to wine quality. However, the ingredients in rank two to five have a critical role in a wine’s quality as well. After the fifth component, it can be seen that the remaining inputs do not influence wine quality in nearly the same manner so these components can be considered inconsequential for wine quality. These variables increase quality to a certain extent. Obviously adding an abundent amount of alcohol or another component will greatly drop a wine’s quality. It is a great balancing game when making quility wine. These physicochemical components will alter a wine’s quality when added or removed in small amounts, not excess. This means that on a small scale, these rankings should be considered.
With help from the correlation chart, it can be seen that some of the components have relationships. Citric acid and fixed acidity, density and fixed acidity and total sulfur dioxide and free sulfur dioxide have relatively strong relationships, while pH and fixed acidity, pH and citric acid, and alcohol and density have moderatly strong inverse relationships.